Bayesian Linear Regression

Analysis of Flight Delay Data

Author

Sara Parrish (Advisor: Dr.Β Seals)

Published

Oct 31, 2024

Slides: slides.html ( Go to slides.qmd to edit)

1 Introduction

Bayesian inference was initially developed by Reverend Thomas Bayes, but his ruminations on inverse probability wouldn’t be known until a friend polished and submitted his work to the Royal Society. Bayes’ work was eventually developed and refined by Laplace. Bayesian inference was wildly different from Fisher’s work in defining classical statistical theory (Lesaffre and Lawson 2012).

In opposition to the Bayesian approach is the frequentist approach. The frequentist approach considers the parameter of interest fixed and inference on the parameter is based on the result of repeated sampling. In the Bayesian approach, the parameter of interest is not fixed but stochastic, and inference on the parameter is based on observed data and prior knowledge (Lesaffre and Lawson 2012).

A benefit of the Bayesian approach lies in the ability to include prior knowledge through the selection of a prior. Priors can be subjective or objective. Subjective priors incorporate opinions, of an individual or of a group, which can negatively impact the perceived integrity of the findings. Objective priors are preferred which follow formal rules for determining noninformative priors (Lesaffre and Lawson 2012).

When prior knowledge is lackluster, has little information, or is otherwise not sufficient, a noninformative prior may be used. A common choice for a noninformative prior, in cases with binomial likelihood, is a uniform distribution, also known as a flat prior. In cases with a Gaussian likelihood, the noninformative prior would be a normal prior with \(\sigma^2 \to \infty\) which functions similarly to the flat prior. For cases with a Poisson likelihood, a Gamma(\(\alpha_0\), \(\beta_0\)) prior is used where the sum of the counts are \((\alpha_0 - 1)\) and the experiment size is \(\beta_0\) (Lesaffre and Lawson 2012). For normal linear regression models, conjugate normal-gamma priors can be used to provide a resulting posterior that is of the same family(Yan and Su 2009).

There are a variety of ways to summarize the posterior in order to derive conclusions about the parameter. Its location and variability can be specified by finding the mode, mean, and median of the distribution. Its range can be defined with the equal tail credible interval (not to be confused with the confidence interval in the frequentist approach) or with the high posterior density (HDP) interval. Future predictions for the parameter can be made through posterior predictive distributions (PPD) which factor out \(\theta\) with the assumption that past data will not influence future data, hierarchical independence(Lesaffre and Lawson 2012).

It is not uncommon for the marginal posterior density function of a parameter to be unavailable, requiring an alternate approach to extract insight. It is safe to assert that sampling techniques are used in nearly all Bayesian analyses(Yan and Su 2009). General purpose sampling algorithms available include the inverse ICDF method, the accept-reject algorithm, importance sampling, and the commonly used Monte Carlo integration. The Monte Carlo integration replaces the integral of the posterior with the average obtained from randomly sampled values to provide an expected value. Monte Carlo integration can be combined with the Markov property given by (Lesaffre and Lawson 2012) in the following equation

\[ p(\Theta^{(k+1)}| \Theta^k , \Theta^{(k-1)} , . . . , y) = p(\Theta^{(k+1)} | \Theta^k, y)\tag{2} \]

Two popular Markov chain Monte Carlo (MCMC) methods are the Gibbs sampler and the Metropolis-Hastings (MH) algorithm(Lesaffre and Lawson 2012).

2 Methods

2.1 Mathematical Foundations

2.1.1 The Frequentist Framework

Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model

\[ Y = \beta_0 + \beta_1X + \varepsilon \tag{1} \]

  • \(Y\) : Dependent variable, the outcome
  • \(\beta_0\) : y intercept
  • \(\beta_1\) : The regression coefficient
  • \(X\) : Independent variable
  • \(\varepsilon\) : Random error (Yan and Su 2009)
  • \(\hat\beta\) provides a point estimate

2.1.2 The Bayesian Framework

The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:

\[ Posterior = \frac{Likelihood \times Prior}{Normalization} \]

  • The \(Prior\) is model of prior knowledge on the subject
  • The \(Likelihood\) is the probability of the data given the prior
  • The \(Normalization\) is a constant that ensures the posterior distribution is a valid density function whose integration is equal to 1
  • The \(Posterior\) is the probability model that expresses an updated view of the model parameters
  • From the initial parameters of the prior

In terms of calculating probability, Bayes’ rule can be written as

\[ p(B|A) = \frac{p(A|B)\cdot p(B)}{p(A)} \tag{2} \]

  • Bayes’ rule allows for the calculation of inverse probability (\(p(B|A) \text{ from } p(A|B)\))
    • \(p(B|A) \text{ and } p(A|B)\) are conditional probabilities
    • \(p(A) \text{ and } p(B)\) are marginal probabilities (Lesaffre and Lawson 2012)

For continuous parameters, Bayes rule can be written as

\[ \begin{align*} p(\theta|y) =& \frac{ L(\theta|y)p(\theta) }{p(y)}\\ \\ p(\theta|y) \propto & \text{ }L(\theta|y)p(\theta) \end{align*} \]

The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes 1763). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).

\[ f(\theta|D) \propto L(\theta|D)f(\theta) \tag{1} \]

2.1.3 The Model

To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include a the mean specific to our predictor, departure time, \(\mu_i\). The model is:

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \end{align*} \] Where: - \(Y_i\) is the arrival delay for the i-th flight - \(X_i\) is the departure delay for the i-th flight - \(\mu_i = \beta_0 + \beta_1X_i\) is the local mean arrival delay, , specific to the departure time - \(\sigma^2\) is the variance of the errors - \(\overset{\text{ind}}{\sim}\) indicates conditional independence of each arrival delay with the given parameters

2.1.4 Prior Selection

Since we are only using two data variables, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use normal prior models (Johnson, Ott, and Dogucu 2022). \[ \begin{align*} \beta_0 &\sim N(m_0, s^2_0)\\ \beta_1 &\sim N(m_1, s^2_1) \end{align*} \]

where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.

The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).

\[ \sigma \sim \text{Exp}(l) \]

Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).

\[ \begin{align*} E(\sigma) = \frac{1}{l} && \text{and} && SD(\sigma) = \frac{1}{l} \end{align*} \] #### 2.1.5 Tuning of Priors

Tip

Flat vs default vs tuned priors

2.1.6 The Bayesian Linear Regression Model

The model can be written as

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

2.2 Assumptions ***

Tip

β€œNormal regression assumptions

The appropriateness of the Bayesian Normal regression model (9.3) depends upon the following assumptions.

  • Structure of the data
    • Accounting for predictor \(X\), the observed data \(Y_i\) on case \(i\) is independent of the observed data on any other case \(j\).
  • Structure of the relationship
    • The typical \(Y\) outcome can be written as a linear function of predictor X, \(ΞΌ = Ξ²_0 + Ξ²_1X\).
  • Structure of the variability
    • At any value of predictor \(X\), the observed values of \(Y\) will vary normally around their average \(ΞΌ\) with consistent standard deviation \(Οƒ\).”

2.3 Statistical Programming

Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a dataset compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the β€œrstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the β€œrstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated with the prediction_summary() function from the β€œbayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals.

Tip

Possibly k-fold cross validation, model averaging

3 Analysis and Results

3.1 The Dataset

This is a large dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minuted and attributed reasons for delay.

3.2 Exploratory Data Analysis

Following are the definitions of the given variables in this dataset.

Header Description
Fl Date Flight Date (yyyy-mm-dd)
Airline Airline Name
Airline DOT Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code Unique Carrier Code
DOT Code An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number Flight Number
Origin Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City Origin City Name, State Code
Dest Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City Destination City Name, State Code
CRS Dep Time CRS Departure Time (local time: hhmm)
Dep Time Actual Departure Time (local time: hhmm)
Dep Delay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out Taxi Out Time, in Minutes
Wheels Off Wheels Off Time (local time: hhmm)
Wheels On Wheels On Time (local time: hhmm)
Taxi In Taxi In Time, in Minutes
CRS Arr Time CRS Arrival Time (local time: hhmm)
Arr Time Actual Arrival Time (local time: hhmm)
Arr Delay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Cancelled Cancelled Flight Indicator (1=Yes)
Cancellation Code Specifies The Reason For Cancellation
Diverted Diverted Flight Indicator (1=Yes)
CRS Elapsed Time CRS Elapsed Time of Flight, in Minutes
Actual Elapsed Time Elapsed Time of Flight, in Minutes
Air Time Flight Time, in Minutes
Distance Distance between airports (miles)
Carrier Delay Carrier Delay, in Minutes
Weather Delay Weather Delay, in Minutes
NAS Delay National Air System Delay, in Minutes
Security Delay Security Delay, in Minutes
Late Aircraft Delay Late Aircraft Delay, in Minutes
Code
Table1.2_total <- Delays1 %>%
  summarise(
    .by = NULL,
    flight_period = "Total",
    TotalFlights = n(),
    TotalUniqueDates = n_distinct(fl_date),
    TotalUniqueOrigins = n_distinct(origin),
    TotalUniqueDestinations = n_distinct(dest),
    AvgCRSDepTime = mean(crs_dep_time, na.rm = TRUE),
    AvgDepTime = mean(dep_time, na.rm = TRUE),
    AvgDepDelay = round(mean(dep_delay, na.rm = TRUE), 2),
    AvgTaxiOut = round(mean(taxi_out, na.rm = TRUE), 2),
    AvgTaxiIn = round(mean(taxi_in, na.rm = TRUE), 2),
    AvgCRSArrTime = mean(crs_arr_time, na.rm = TRUE),
    AvgArrTime = mean(arr_time, na.rm = TRUE),
    AvgArrDelay = round(mean(arr_delay, na.rm = TRUE), 2),
    AvgAirTime = round(mean(air_time, na.rm = TRUE), 2),
    CancelledFlights = sum(cancelled, na.rm = TRUE),
    DivertedFlights = sum(diverted, na.rm = TRUE), 
    AvgCarrierDelay = round(mean(carrier_delay, na.rm = TRUE), 2),
    AvgSecurityDelay = round(mean(security_delay, na.rm = TRUE), 2),
    AvgWeatherDelay = round(mean(weather_delay, na.rm = TRUE), 2),
    AvgNASDelay = round(mean(nas_delay, na.rm = TRUE), 2),
    AvgLateAircraftDelay = round(mean(lateaircraft_delay, na.rm = TRUE), 2),
    CarrierDelay_ct = sum(carrier_delay > 0),
    SecurityDelay_ct = sum(security_delay > 0),
    WeatherDelay_ct = sum(weather_delay > 0),
    NASDelay_ct = sum(nas_delay > 0),
    LateAircraftDelay_ct = sum(lateaircraft_delay > 0)) %>%
  mutate(
    TotalFlightsCount = sprintf("%d (100%%)", TotalFlights),
    CancelledFlightsCount = sprintf("%d (100%%)", CancelledFlights),
    DivertedFlightsCount = sprintf("%d (100%%)", DivertedFlights),
    CarrierDelayCount = sprintf("%d (100%%)", CarrierDelay_ct),
    SecurityDelayCount = sprintf("%d (100%%)", SecurityDelay_ct),
    WeatherDelayCount = sprintf("%d (100%%)", WeatherDelay_ct),
    NASDelayCount = sprintf("%d (100%%)", NASDelay_ct),
    LateAircraftDelayCount = sprintf("%d (100%%)", LateAircraftDelay_ct)
  )

Table1.2_combined <- bind_rows(Table1.2, Table1.2_total)



library(lubridate)

# Converting time HHMM.SS to HH:MM:SS

convert_to_time <- function(time_val) {
  rounded_time <- round(time_val, 2)
  hours <- floor(rounded_time / 100)
  minutes_with_secs <- (rounded_time %% 100)
  minutes <- floor(minutes_with_secs)
  seconds <- round((minutes_with_secs - minutes) * 60, 0)
  time_formatted <- sprintf("%02d:%02d:%02d", hours, minutes, seconds)
  return(time_formatted)
}

#Apply time conversion, remove extra rows

Table1.3_combined <- Table1.2_combined %>%
  mutate(
    AvgCRSDepTime = sapply(AvgCRSDepTime, convert_to_time),
    AvgDepTime = sapply(AvgDepTime, convert_to_time),
    AvgCRSArrTime = sapply(AvgCRSArrTime, convert_to_time),
    AvgArrTime = sapply(AvgArrTime, convert_to_time),
  ) %>%
  mutate(across(-flight_period, as.character)
  ) %>%
  select(
    flight_period,
    TotalFlightsCount,
    CancelledFlightsCount,
    DivertedFlightsCount,
    AvgCRSDepTime,
    AvgDepTime,
    AvgDepDelay,
    AvgTaxiOut,
    AvgTaxiIn,
    AvgCRSArrTime,
    AvgArrTime,
    AvgArrDelay,
    AvgAirTime,
    CarrierDelayCount,
    SecurityDelayCount,
    WeatherDelayCount,
    NASDelayCount,
    LateAircraftDelayCount
  )

#Pivot table

Table1.3_pivoted <- Table1.3_combined %>% 
  pivot_longer(
    cols = -flight_period,
    names_to = "Statistic", 
    values_to = "Value") %>% 
  pivot_wider(
    names_from = flight_period,
    values_from = Value
)

#gt Table1

Table1.3_pivoted %>%
  gt() %>%
  tab_header(
    title = "Flight Delay Summary by Flight Period"
  ) %>%
  cols_label(
    Statistic = "Flight Period",
    Morning = "Morning",
    Afternoon = "Afternoon",
    Evening = "Evening",
    Total = "Total"
  ) %>%
  tab_spanner(
    label = "Flight Period",
    columns = c(Morning, Afternoon, Evening, Total)
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white"), 
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_body(
      columns = everything()
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_column_labels(
      columns = everything()
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white", weight = "bold"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_title(
      groups = c("title", "subtitle")
    )
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white", weight = "bold"),
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_column_spanners(
      spanners = everything()
    )
  ) %>%
  tab_source_note(
    source_note = "Table 1: Summary includes morning, afternoon, and evening flight periods."
  ) %>%
  tab_style(
    style = list(
      cell_text(color = "white"), 
      cell_fill(color = "rgba(0, 43, 54, 1)")
    ),
    locations = cells_source_notes()
  )
Flight Delay Summary by Flight Period
Flight Period
Flight Period
Morning Afternoon Evening Total
TotalFlightsCount 1246031 (41.5%) 1423140 (47.4%) 330829 (11.0%) 3000000 (100%)
CancelledFlightsCount 30690 (38.8%) 38343 (48.4%) 10107 (12.8%) 79140 (100%)
DivertedFlightsCount 2555 (36.2%) 3901 (55.3%) 600 (8.5%) 7056 (100%)
AvgCRSDepTime 08:49:31 15:73:19 20:66:23 13:27:04
AvgDepTime 08:53:58 15:89:05 20:12:40 13:29:47
AvgDepDelay 5.23 12.93 16.51 10.12
AvgTaxiOut 16.87 16.44 16.65 16.64
AvgTaxiIn 7.75 7.78 6.95 7.68
AvgCRSArrTime 10:87:15 17:85:11 17:42:14 14:90:34
AvgArrTime 10:86:01 17:71:56 15:89:47 14:66:31
AvgArrDelay -0.77 7.34 10.04 4.26
AvgAirTime 114.12 109.8 116.31 112.31
CarrierDelayCount 86824 (29.2%) 162266 (54.6%) 47861 (16.1%) 296951 (100%)
SecurityDelayCount 887 (32.1%) 1434 (52.0%) 438 (15.9%) 2759 (100%)
WeatherDelayCount 8380 (26.7%) 18758 (59.7%) 4290 (13.7%) 31428 (100%)
NASDelayCount 80604 (31.4%) 144366 (56.3%) 31507 (12.3%) 256477 (100%)
LateAircraftDelayCount 42721 (16.5%) 168902 (65.2%) 47391 (18.3%) 259014 (100%)
Table 1: Summary includes morning, afternoon, and evening flight periods.

The three flight periods are each comprised of 8-hour segments (i.e.Β Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.

Some Visualizations of the Dataset

These histograms illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.

This visualization shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.

This heat map shows the average arrival delay for flights at their origin airport. This comes from the idea that if a flight is delayed at departure, then it may also be delayed on arrival at it’s destination.

Code
p_value_long <- melt(p_value_results, na.rm = TRUE)  # Remove NAs

ggplot(data = p_value_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "blue", high = "red", na.value = "gray90", name = "p-value") +
  labs(x = "Categorical Variable", y = "Continuous Variable", title = "Heatmap of p-values for Categorical vs Continuous Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4 Modeling

4.1 Data Preprocessing

  • Removal of Cancelled and Diverted Flights
  • Departure Time HHMM -> Minutes past Midnight (function() from)
Code
ggplot(Delays_sample, aes(x = DEP_TIME)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(caption="Distribution of Departure Time",
       x = "Departure Time (time HHMM)",
       y = "Frequency") +
  xlim(NA, 2400)

Code
convert_to_minutes <- function(time) {
  hour <- time %/% 100
  minute <- time %% 100
  total_minutes <- hour * 60 + minute
  return(total_minutes)
}

Delays_sample <- Delays_sample %>%
  mutate("DEP_TIME_MINS" = sapply(DEP_TIME, convert_to_minutes))
Code
ggplot(Delays_sample, aes(x = DEP_TIME_MINS)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(caption = "Distribution of Departure Time",
        x = "Departure Time (mins past midnight)",
        y = "Frequency") +
  xlim(NA, 1440)

Code
Delays_sample1 <- Delays_sample %>%
  mutate(FL_DATE = as.Date(FL_DATE, format = "%Y-%m-%d"))

ggplot(Delays_sample1, aes(x = FL_DATE)) +
  geom_histogram(binwidth = 30, fill = "skyblue", color = "black") +
  labs(
    caption = "Flight Counts by Date",
    x = "Flight Date",
    y = "Count"
  ) +
  theme_minimal()

Code
library(lubridate)

Delays_sample <- Delays_sample %>%
  mutate(DAY_OF_WEEK = wday(FL_DATE, label = TRUE, abbr = TRUE))


mean_delay_by_day <- Delays_sample %>%
  group_by(DAY_OF_WEEK) %>%
  summarise(mean_arr_delay = mean(ARR_DELAY),
            sd_arr_delay = sd(ARR_DELAY)
  )
Code
ggplot(Delays_sample, aes(x = DAY_OF_WEEK, y = ARR_DELAY)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data = mean_delay_by_day, aes(x = DAY_OF_WEEK, y = mean_arr_delay), 
             color = "red4", size =3, shape = 8) +
  labs(
    caption = "Arrival Delay by Day of the Week",
    x = "Day of the Week",
    y = "Arrival Delay (minutes)"
  ) +
  theme_minimal()+
  ylim(-45,45)

4.2 The Normal Data Model: Departure Time Predictor

For the continuous predictor, Departure Time, three normal models with different priors are sampled: - Flat Priors - Rstanarm’s Default Priors - Tuned Priors

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

4.2.1 Flat Continuous Model

Code
library(broom.mixed)
library(rstanarm)

flat_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS,
                       data = Delays_sample,
                       family = gaussian(),
                       prior = NULL,
                       prior_intercept = NULL,
                       prior_aux = NULL,
                       chains = 4, iter = 2000, seed = 123,
                       refresh = 0
                       )

fmdt <- tidy(flat_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Simulate a set of predictions
set.seed(123)
shortcut_prediction1 <- 
  posterior_predict(flat_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

# 95% posterior credible interval
posterior_interval(shortcut_prediction1, prob = 0.95)
       2.5%    97.5%
1 -97.93328 103.4754
Code
mcmc_dens(shortcut_prediction1) + 
  xlab("Predicted Arrival Delays for a Departure Time of Noon, Flat Priors")

Code
# MCMC diagnostics

model <- flat_model_dt

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

4.2.2 Default Continuous Model

Code
library(rstanarm)

default_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS,
                       data = Delays_sample,
                       family = gaussian(),
                       chains = 4, iter = 2000, seed = 123,
                       refresh = 0
                       )

dmdt <- tidy(default_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Simulate a set of predictions
set.seed(123)
shortcut_prediction2 <- 
  posterior_predict(default_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

# 95% posterior credible interval
posterior_interval(shortcut_prediction2, prob = 0.95)
       2.5%    97.5%
1 -97.82046 103.2472
Code
mcmc_dens(shortcut_prediction2) + 
  xlab("Predicted Arrival Delays for a Departure Time of Noon, Default Priors")

Code
# MCMC diagnostics

model <- default_model_dt

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

4.2.3 Tuned Continuous Model

\(\beta_0\) informs the model intercept
Code
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   556.0   802.0   809.3  1058.0  1440.0 
Code
Delays_sample_filtered_B0 <- subset(Delays_sample, DEP_TIME_MINS >= 800 & DEP_TIME_MINS <= 820)

mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2
[1] 2.123885
Code
sd(Delays_sample_filtered_B0$ARR_DELAY)  #s_0c = 36
[1] 36.41549

\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.

\[ \beta_{0c} \sim N(2, 36^2) \]

\(\beta_1\) informs the model slope
Code
lm_model <- lm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample)

summary(lm_model)

Call:
lm(formula = ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
 -93.78  -19.89  -10.11    3.05 2900.37 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.092e+01  4.718e-01  -23.14   <2e-16 ***
DEP_TIME_MINS  1.903e-02  5.466e-04   34.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51.09 on 97101 degrees of freedom
Multiple R-squared:  0.01233,   Adjusted R-squared:  0.01232 
F-statistic:  1213 on 1 and 97101 DF,  p-value: < 2.2e-16
Code
coef(lm_model)["DEP_TIME_MINS"] #m_1 = 0.01903
DEP_TIME_MINS 
   0.01903357 
Code
summary(lm_model)$coefficients["DEP_TIME_MINS", "Std. Error"] #s_1 = 0.0005
[1] 0.0005465741

The slope of the linear model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).

\[ \beta_{1} \sim N(0.02, 0.01^2) \]

\(\sigma\) informs the regression standard deviation
Code
summary(lm_model)$sigma
[1] 51.0923

To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).

\[ \begin{align*} E(\sigma) &= \frac{1}{l} = 50\\\\ l &= \frac{1}{50} = 0.02\\\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

\[ \begin{align*} Y_i|\beta_0, \beta_1, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_i \\ \beta_{0} &\sim N(2, 36^2)\\ \beta_1 &\sim N(0.02, 0.01^2)\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

Code
library(rstanarm)

tuned_model_dt <- stan_glm(ARR_DELAY ~ DEP_TIME_MINS, 
                        data = Delays_sample,
                        family = gaussian(),
                        prior_intercept = normal(2, 1296),
                        prior = normal(0.02, 0.0001), 
                        prior_aux = exponential(0.02),
                        chains = 4, iter = 2000, seed = 123,
                        refresh = 0
                        )

# Effective sample size ratio and Rhat
neff_ratio(tuned_model_dt)
  (Intercept) DEP_TIME_MINS         sigma 
      0.64225       0.70875       1.04350 
Code
rhat(tuned_model_dt)
  (Intercept) DEP_TIME_MINS         sigma 
     1.000360      0.999620      1.000415 
Code
# Trace plots of parallel chains
mcmc_trace(tuned_model_dt, size = 0.1)

Code
# Density plots of parallel chains
mcmc_dens_overlay(tuned_model_dt)

Code
library(broom.mixed)

tmdt <- tidy(tuned_model_dt, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

# Store the 4 chains for each parameter in 1 data frame
model_tuned_df <- as.data.frame(tuned_model_dt)


##TUNED
# Simulate a set of predictions
set.seed(123)
shortcut_prediction <- 
  posterior_predict(tuned_model_dt, newdata = data.frame(DEP_TIME_MINS = 720))

# Construct a 95% posterior credible interval
posterior_interval(shortcut_prediction, prob = 0.95)
       2.5%   97.5%
1 -98.19376 103.554
Code
mcmc_dens(shortcut_prediction) + 
  xlab("Predicted Arrival Delays for a Departure Time of Noon, Tuned Priors")

Code
# MCMC diagnostics

model <- tuned_model_dt

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

4.3 The Normal Data Model: Week Day Predictor

For arrival delays by the day of the week, the mean arrival delays are between 1 and 7 minutes while the median arrival delays are all in the negative, indicating a skew towards larger delays.

\[ \begin{align*} Y_i|\beta_0, \beta_1, ... \beta_6, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... \beta_6X_{i6} \\ \beta_{0} &\sim N(m_0, s_0^2)\\ \beta_1 &\sim N(m_1, s_1^2)\\ \sigma &\sim \text{Exp}(l) \end{align*} \]

4.3.1 Flat Categorical Model

Code
Delays_sample$DAY_OF_WEEK <- factor(
  Delays_sample$DAY_OF_WEEK, 
  ordered = FALSE)

Delays_sample <- Delays_sample %>%
  mutate(DAY_OF_WEEK = relevel(as.factor(DAY_OF_WEEK), ref = "Tue"))

flat_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  prior = NULL, 
  prior_intercept = NULL, 
  prior_aux = NULL,
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)

model <- flat_model_dow

fmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))


ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Posterior Predicted Distribution of Arrival Delay by Day of the Week",
    x = "Predicted Arrival Delay (minutes)",
    y = "Day of the Week",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Flat Categorical Model
Code
# MCMC diagnostics

model <- flat_model_dow

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

4.3.2 Default Categorical Model

Code
default_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)


model <- default_model_dow

dmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))

ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Posterior Predicted Distribution of Arrival Delay by Day of the Week",
    x = "Predicted Arrival Delay (minutes)",
    y = "Day of the Week",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Code
# MCMC diagnostics

model <- default_model_dow

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

4.3.3 Tuned Categorical Model

\(\beta_0\) informs the model intercept
# A tibble: 7 Γ— 3
  DAY_OF_WEEK mean_arr_delay sd_arr_delay
  <ord>                <dbl>        <dbl>
1 Sun                   6.50         53.8
2 Mon                   4.24         52.3
3 Tue                   1.55         45.6
4 Wed                   3.43         52.5
5 Thu                   5.03         47.5
6 Fri                   5.96         52.5
7 Sat                   4.52         55.1

\(\beta_{0}\) reflects the mean arrival delay on Tuesday, our reference. The average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 46 minutes.

\[ \beta_{0} \sim N(2, 46^2) \]

\(\beta_j\) informs the model slopes

For a categorical predictor with the stan_glm() function, the tuned prior, \(\beta_j\), is applied to to the estimation of each coefficient associated with the individual levels of the predictor ($_1, _2, …, _6 $). For this reason, we set the coefficient prior to be weakly informative.

\[ \beta_{j} \sim N(0, 50^2) \]

\(\sigma\) informs the regression standard deviation
Code
lm_model <- lm(ARR_DELAY ~ DAY_OF_WEEK, data = Delays_sample)

summary(lm_model)$sigma
[1] 51.38906

To tune the exponential model, we set the expected value of the standard deviation, $ E() $, equal to the residual standard error which is the same as with the previous model, \(\sim 50\).

\[ \begin{align*} E(\sigma) &= \frac{1}{l} = 50\\\\ l &= \frac{1}{50} = 0.02\\\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

The tuned model is as follows,

\[ \begin{align*} Y_i|\beta_0, \beta_1, ... \beta_6, \sigma &\overset{\text{ind}}{\sim} N (\mu_i, \sigma^2) && \text{with } && \mu_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + ... \beta_6X_{i6} \\ \beta_{0} &\sim N(2, 46^2)\\ \beta_j &\sim N(0, 50^2)\\ \sigma &\sim \text{Exp}(0.02) \end{align*} \]

Code
tuned_model_dow <- stan_glm(
  ARR_DELAY ~ DAY_OF_WEEK, 
  data = Delays_sample, 
  family = gaussian(),
  prior = normal(2,2116), 
  prior_intercept = normal(0,2500), 
  prior_aux = exponential(0.02),
  chains = 4, iter = 2000, seed = 123,
  refresh = 0
)


model <- tuned_model_dow

tmdow <- tidy(model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.95)

library(ggplot2)
library(ggridges)
library(dplyr)

#Stacked distributions

new_data <- data.frame(DAY_OF_WEEK = levels(Delays_sample$DAY_OF_WEEK))
predictions <- posterior_predict(model, newdata = new_data)

pred_df <- as.data.frame(predictions)
colnames(pred_df) <- levels(Delays_sample$DAY_OF_WEEK)

pred_long <- pred_df %>%
  pivot_longer(cols = everything(), names_to = "DAY_OF_WEEK", values_to = "ARR_DELAY")


pred_long$DAY_OF_WEEK <- factor(pred_long$DAY_OF_WEEK, levels = c("Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"))

ggplot(pred_long, aes(x = ARR_DELAY, y = DAY_OF_WEEK, fill = DAY_OF_WEEK)) +
  geom_density_ridges(alpha = 0.7, scale = 0.8) +
  stat_summary(fun = mean, 
               geom = "vline", 
               aes(xintercept = ..x.., color = DAY_OF_WEEK),
               linetype = "dashed", linewidth = 0.5, show.legend = FALSE) +
  labs(
    title = "Posterior Predicted Distribution of Arrival Delay by Day of the Week",
    x = "Predicted Arrival Delay (minutes)",
    y = "Day of the Week",
    fill = "Day"
  ) +
  xlim(-150,150)+
  theme_minimal()

Code
# MCMC diagnostics

model <- tuned_model_dow

mcmc_trace(model, size = 0.1)

Code
mcmc_dens_overlay(model)

Code
mcmc_acf(model)

Mean SD 95% CI
Continuous


Flat Model
𝛽₀ Intercept -10.92 0.47 (-11.85; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.86)
𝜎 51.09 0.12 (50.86; -11.86)


Default Tuned Model
𝛽₀  Intercept -10.94 0.47 (-11.86; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.86)
𝜎 51.09 0.12 (50.86; -12.02)


Tuned Model
𝛽₀ Intercept -11.66 0.17 (-12.02; 0.02)
𝛽₁ Departure Time 0.02 0.00 (0.02; 50.87)
𝜎 51.10 0.12 (50.87; 0.00)
Categorical







Flat Model
𝛽₀ Intercept 1.57 0.44 (0.69; 3.72)
𝛽₁ Wednesday 4.92 0.61 (3.72; 1.47)
𝛽₂ Thursday 2.66 0.65 (1.47; 0.69)
𝛽₃ Friday 1.86 0.62 (0.69; 2.25)
𝛽₄ Saturday 3.43 0.61 (2.25; 3.21)
𝛽₅ Sunday 4.36 0.61 (3.21; 1.72)
𝛽₆ Monday 2.93 0.65 (1.72; 51.16)
𝜎 51.39 0.12 (51.16; 0.73)







Default Tuned Model
𝛽₀ Intercept 1.54 0.40 (0.73; 3.21)
𝛽₁ Wednesday 4.40 0.60 (3.21; 1.48)
𝛽₂ Thursday 2.69 0.60 (1.48; 1.75)
𝛽₃ Friday 2.97 0.64 (1.75; 3.77)
𝛽₄ Saturday 4.96 0.59 (3.77; 2.31)
𝛽₅ Sunday 3.48 0.58 (2.31; 0.74)
𝛽₆ Monday 1.89 0.60 (0.74; 51.17)
𝜎 51.40 0.12 (51.17; 0.66)







Tuned Model
𝛽₀ Intercept 1.54 0.44 (0.66; 3.76)
𝛽₁ Wednesday 4.96 0.64 (3.76; 1.48)
𝛽₂ Thursday 2.70 0.61 (1.48; 0.71)
𝛽₃ Friday 1.91 0.64 (0.71; 2.28)
𝛽₄ Saturday 3.50 0.63 (2.28; 3.23)
𝛽₅ Sunday 4.41 0.61 (3.23; 1.73)
𝛽₆ Monday 2.99 0.64 (1.73; 51.17)
𝜎 51.39 0.12 (51.17; 0.00)
  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

4.4 Evaluation of the Model

Tip

All notes past this point are for future reference

Consideration the Dataset

  • how was data collected, by who and for what purpose, whaty impact might the results have, what biases are baked in to the analyses (scope of the data)

Checking Model Assumptions

β€œNormal regression assumptions

The appropriateness of the Bayesian Normal regression model (9.3) depends upon the following assumptions.

  • Structure of the data
    • Accounting for predictor \(X\), the observed data \(Y_i\) on case \(i\) is independent of the observed data on any other case \(j\).
  • Structure of the relationship
    • The typical \(Y\) outcome can be written as a linear function of predictor X, \(ΞΌ = Ξ²_0 + Ξ²_1X\).
  • Structure of the variability
    • At any value of predictor \(X\), the observed values of \(Y\) will vary normally around their average \(ΞΌ\) with consistent standard deviation \(Οƒ\).”

Assumption 1: https://www.bayesrulesbook.com/chapter-10#:~:text=Though%20formal%20hypothesis%20tests%20for%20assumption%201%20do%20exist%2C%20we%20can%20typically%20evaluate%20its%20appropriateness%20by%20the%20data%20collection%20context

Assumption 2 & 3: scatterplot of raw data, pp_check

# Examine 50 of the 20000 simulated samples
pp_check(bike_model, nreps = 50) + 
  xlab("rides")

Accuracy of Posterior Predictive Models



set.seed(84735)
predictions <- posterior_predict(bike_model, newdata = bikes)
dim(predictions)


library(bayesplot)

#ppc_intervals() function in the bayesplot package provides a quick visual summary of the 500 approximate posterior predictive models stored in predictions
ppc_intervals(bikes$rides, yrep = predictions, x = bikes$temp_feel, 
              prob = 0.5, prob_outer = 0.95)


library(bayesrules)

# Posterior predictive summaries
set.seed(84735)
prediction_summary(bike_model, data = bikes) # get MAE, MAE scaled, obs values that fall within 50 and 95 % posterior prediction interval   

5 Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Bayes, T. 1763. β€œAn Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”
Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. β€œJoint Longitudinal and Time-to-Event Models via Stan.” https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Grolemund, Garrett, and Hadley Wickham. 2011. β€œDates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Yan, Xin, and Xiao Gang Su. 2009. Linear Regression Analysis: Theory and Computing. Singapore: World Scientific Publishing. https://ebookcentral.proquest.com/lib/uwf/reader.action?docID=477274&ppg=318&pq-origsite=primo.
Zelazko, Patrick. 2023. β€œFlight Delay and Cancellation Dataset (2019-2023).” https://www.kaggle.com/datasets/patrickzel/flight-delay-and-cancellation-dataset-2019-2023.